magic starSummarize by Aili

The Great AI Challenge: We Test Five Top Bots on Useful, Everyday Skills

๐ŸŒˆ Abstract

The article compares and evaluates various AI chatbots, including ChatGPT, Claude, Copilot, Gemini, and Perplexity, across different categories such as health, finance, cooking, work writing, creative writing, summarization, current events, and coding. It highlights the unique strengths and weaknesses of each chatbot and provides insights into their performance.

๐Ÿ™‹ Q&A

[01] Meet the models

1. What are the key features of the AI chatbots discussed in the article?

  • ChatGPT by OpenAI is celebrated for its versatility and ability to remember user preferences.
  • Anthropic's Claude is geared to be inoffensive.
  • Microsoft's Copilot leverages OpenAI's technology and integrates with services like Bing and Microsoft 365.
  • Google's Gemini accesses the popular search engine for real-time responses.
  • Perplexity is a research-focused chatbot that cites sources with links and stays up to date.

2. How did the authors assess the capabilities of these chatbots?

  • The authors used the $20-a-month paid versions of the chatbots to assess their full capabilities across a wide range of tasks.
  • They crafted a series of prompts to test popular use cases, including coding challenges, health inquiries, and money questions.
  • The results were judged by Journal newsroom editors and columnists, who rated the chatbots on accuracy, helpfulness, and overall quality.

[02] Health

1. What were the key findings regarding the chatbots' performance in the health category?

  • Many of the chatbots' answers on health-related topics sounded similar.
  • Gemini gave a brief, general recommendation when asked about the best age to get pregnant, while Perplexity provided a more detailed response.
  • ChatGPT's answers improved with the recent GPT-4o update and finished as the category winner.

[03] Finance

1. How did the chatbots perform in the finance category?

  • Claude had the best answers for the Roth vs. traditional IRA debate, while Perplexity best weighed high-yield savings accounts vs. CDs.
  • Gemini provided the best answer to a question about when to withdraw funds from an inherited $1 million IRA.
  • ChatGPT and Copilot fell behind in this category.

[04] Cooking

1. What were the key findings in the cooking category?

  • ChatGPT, the category winner, provided a creative but realistic menu in response to a prompt with random ingredients.
  • Perplexity impressed with the detailed cooking steps provided with its own clever menu.
  • Gemini took the cake when asked to provide a recipe for a chocolate dessert that addresses many dietary restrictions, while Copilot failed by including eggs and butter.

[05] Work writing

1. How did the chatbots perform in the work writing category?

  • Perplexity nailed the job posting for a "prompt engineer" with the right mix of journalism and AI bot knowledge.
  • Copilot missed the mark by never mentioning prompt engineering at all.
  • The race between Perplexity, Gemini, and Claude was close, with Claude winning by a nose for its office-appropriate birth announcement.

[06] Creative writing

1. How did the chatbots perform in the creative writing category?

  • Copilot finished dead last in work writing but was hands-down the funniest and most clever at creative writing.
  • Claude was the second-best in creative writing, with clever zingers about both presidential challengers.
  • Perplexity made a rare flub by erroneously attributing a lyric from the 2011 musical "The Muppets" to Kermit.

[07] Summarization

1. What were the key findings in the summarization category?

  • Perplexity consistently summarized things well, including the subtitles it skimmed in a YouTube video.
  • Copilot answered in a skimmable outline format and included lesser-known fun facts when summarizing a Wikipedia page.
  • Gemini was unable to handle web links, as the premium Claude account also couldn't.

[08] Current events

1. How did the chatbots perform in the current events category?

  • Perplexity stayed on top with balanced reasoning and solid sourcing when asked about the upcoming presidential election.
  • ChatGPT faltered when first tested but improved with the GPT-4o upgrade, moving into second place.
  • Gemini didn't want to answer the election question.

[09] Coding

1. How did the chatbots perform in the coding category?

  • All the chatbots did fairly well with coding, according to the blind judging by the Journal's data journalist.
  • Perplexity managed to eke out a win, followed by ChatGPT and Gemini.

[10] Speed

1. How did the chatbots perform in the speed tests?

  • ChatGPT with the GPT-4o update was the fastest, clocking in at 5.8 seconds for the "Explain Einstein's theory of relativity in five sentences" prompt.
  • Claude and Perplexity were much slower than the other three chatbots throughout the tests.

[11] Overall results

1. What were the key takeaways from the overall evaluation of the chatbots?

  • Each chatbot has unique strengths and weaknesses, making them all worth exploring.
  • The bots provided mostly helpful answers and avoided controversy, with few outright errors or "hallucinations."
  • Perplexity, a lesser-known chatbot, was the overall champion, optimizing for conciseness and identifying the most essential components.
  • The big tech players, Microsoft and Google, did not necessarily have an advantage, with Copilot and Gemini fighting hard to stay in the game.
  • With AI developing rapidly, the chatbots may continue to leapfrog one another in the foreseeable future.
Shared by Daniel Chen ยท
ยฉ 2024 NewMotor Inc.